Bayesian model assessment

Mike Irvine

Introduction

The Bayesian approach is “the explicit use of external evidence in the design, monitoring, analysis, interpretation and reporting of a (scientific investigation)” (Spiegelhalter, 2004)

A Bayesian is one who, vaguely expecting to see a horse and catching a glimpse of a donkey, strongly concludes he has seen a mule. (Senn, 1997)

Bayesian primer

  • In maximum likelihood paradigm we try and maximize Pr(\text{Data}|\text{Parameters}) which is the Probability of having observed the data given some fixed parameters
  • For example, I sample (ask) three people whether they like raisin cookies. One person answers yes and two people answer no.
  • Under maximum likelihood the estimated prevalence of liking raisin cookies in the population is 1/3.
  • What if we wanted to ask about the probability of people in the population liking raisin cookies is greater than half? Unfortunately can’t answer these types of questions under maximum likelihood

Bayesian primer

  • Instead we can ask Probability of parameter having a particular value given some data or Pr(\text{Parameters}|\text{Data})
  • Mathematically we switch this to considering the likelihood and the prior. This is the probability of a parameter before observing the data and represents the previous scientific and expert knowledge

Bayesian primer

  • Think about this in terms of a log-probabilities \log Pr(\text{Parameters}|\text{Data}) = \log Pr(\text{Data}|\text{Parameters}) + \log Pr(\text{Parameters}) + C
  • If prior probability is low then log is much less than zero.
  • However if evidence is high then counteracts low prior probability
  • Result is a probability statement about our value of interest given the evidence we’ve observed and previous knowledge.

Example of Bangladesh Wells

  • The data are for an area of Arahazar upazila, Bangladesh. The researchers labelled each well with its level of arsenic and an indication of whether the well was “safe” or “unsafe.”
  • Those using unsafe wells were encouraged to switch.
  • After several years it was determined whether each household using an unsafe well had changed its well.
  • These data are used by Gelman and Hill (2007) for a logistic-regression example.

Data exploration

Data exploration

Data exploration

Data exploration

Potential first model

flowchart LR

subgraph Priors
  mean0[Mean = 0]
  sd10[Standard Deviation = 10]
  np[Normal Prior]
  mean0 --> np
  sd10 --> np
end

np --> A
np --> C
A[distance] --> B(inv-logit probability)
C[arsenic] --> B

subgraph Likelihood
  B --> D{Swtich}
end
  • Two covariates named arsenic and distance
  • Both covariates have a normal prior with mean 0 and standard deviation 10
  • The inverse logit probability for each observation is a linear combination of the above
  • The outcome labeled switch is a Bernoulli trial with probability being defined above

Generate first model

bprior <- c(
  prior_string("normal(0,10)", coef = "distance", class = "b"),
  prior_string("normal(0,10)", coef = "arsenic", class = "b")
)
wells_data <- Wells %>% mutate(switch_numeric = if_else(switch == "yes", 1, 0))
fit <- brm(switch_numeric ~ distance + arsenic,
  data = wells_data, family = bernoulli(),
  prior = bprior, chains = 2, iter = 1000, cores = 2
  )
fit %>% saveRDS(here::here(
  "model_assessment_talk", "rds",
  "first_model_posterior.rds"
))

First step: sanity checks

summary(fit,priors=TRUE)
 Family: bernoulli 
  Links: mu = logit 
Formula: switch_numeric ~ distance + arsenic 
   Data: wells_data (Number of observations: 3020) 
  Draws: 2 chains, each with iter = 1000; warmup = 500; thin = 1;
         total post-warmup draws = 1000

Priors: 
b_arsenic ~ normal(0,10)
b_distance ~ normal(0,10)
Intercept ~ student_t(3, 0, 2.5)

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.01      0.08    -0.14     0.17 1.00      654      794
distance     -0.01      0.00    -0.01    -0.01 1.00     1075      689
arsenic       0.46      0.04     0.37     0.55 1.01      457      414

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
  • Does structure of the model conform with our understanding?
  • Do model estimates make sense? Can they be negative or extend over the range indicated by the credible interval?
  • Are generated observations from the posterior meaningful?

Conditional effects

Conditional effects

Prior predictive check

flowchart LR

subgraph Priors
  mean0[Mean = 0]
  sd10[Standard Deviation = 10]
  np[Normal Prior]
  mean0 --> np
  sd10 --> np
end

np --> A
np --> C
A[distance] --> B(inv-logit probability)
C[arsenic] --> B

subgraph Likelihood
  B --> D{Swtich}
end
  • Using a generative data model, so we can always sample parameters and observations
  • Can decompose model into prior and likelihood. Pretend we haven’t observed any data to understand how prior influences resulting sample observations

Prior predictive check to determine appropriate scale Wells example

bprior <- c(
  prior_string("normal(0,10)", coef = "distance", class = "b"),
  prior_string("normal(0,10)", coef = "arsenic", class = "b")
)
fit_prior <- brm(switch_numeric ~ distance + arsenic,
  data = wells_data,
  family = bernoulli(), prior = bprior,
  sample_prior = "only", chains = 2, iter = 1000, cores = 2
)
fit_prior %>% saveRDS(here::here(
  "model_assessment_talk", "rds",
  "first_model_prior.rds"
))

Prior predictive check - arsenic

Prior predictive check - distance

Prior predictive check - switching

Prior predictive check

  • Analogy: Checking over your shoulder before merging
  • Can determine whether priors are providing too much or too little flexibility
  • Can spot issues in data, for example coefficients are on different scales

Assessment for a linear model

Linear model - prior predictive check

Linear model - prior predictive check

Assessing Model Performance

  • Information Criterion approaches (e.g. WAIC)
  • Cross validation using a k-folds approach
  • Cross-validation using leave-one-out

Leave-one-out cross-validation example


Computed from 1000 by 3020 log-likelihood matrix

         Estimate   SE
elpd_loo  -1968.6 15.7
p_loo         3.4  0.1
looic      3937.2 31.5
------
Monte Carlo SE of elpd_loo is 0.1.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

LOO - linear example

Cross-validation for time-series Leave-Future-Out (LFO)

Posterior predictive check

flowchart LR

subgraph Priors
  mean0[Mean = 0]
  sd10[Standard Deviation = 10]
  np[Normal Prior]
  mean0 --> np
  sd10 --> np
end

np --> A
np --> C
A[distance] --> B(inv-logit probability)
C[arsenic] --> B

subgraph Likelihood
  B --> D{Swtich}
end
  • Posterior combines information from the prior and information from the data (likelihood)
  • Data should look “uninteresting” when compared to samples from the posterior

Linear example of posterior predictive check

Linear model - posterior predictive check

Sensitivity Analysis

  • Highlight the significance of sensitivity analysis in Bayesian modeling
  • Discuss how varying model parameters can help assess the robustness of model predictions
  • Provide an example of investigating the sensitivity of disease transmission rates

Generating and interpreting residuals for a Bayesian model

  • Residuals can be generated in a similar way to maximum likelihood models
  • Residuals come with uncertainty due to uncertainty generated around observation process
  • Similarly residuals can be used to assess model fit, accuracy, bias, or miss-specification
  • Additionally way of generating Bayesian “p-values”.

Bayesian residuals- Linear example

Bayesian p-value Q-Q plot

Data mining the residuals

  • Wide variety of Data Mining algorithms in use
  • Large debate about use in process modeling and forecasting
  • Potentially useful for generating hypothesis about when/where model fails
  • Potential approaches
    • CART
    • GAM
    • Random Forests
    • Boosted regression trees
    • Artificial Neural Network
    • Support Vector Machines

Conclusion

  • Bayesian model construction and development similar to maximum likelihood approach
  • Check model reasoning, under- or over-fitting, bias, and miss-specification
  • In addition need to check reasoning of the prior